Delve into the intricate world of fashion commerce with our dataset, 'Wardrobe Economics: Examining Sales, Categories, and Geography for Products (Clothes and Accessories'). This comprehensive collection of data opens the door to a wealth of insights waiting to be discovered. Gain a profound understanding of consumer preferences and buying behavior as you immerse yourself in the intricate details of sales transactions, diverse product categories, and the geographical spread of fashion trends.
Unveil the hidden narratives behind each transaction, revealing the ebb and flow of consumer demands across a spectrum of stylish offerings. Delve into the intricacies of product categories, allowing you to identify burgeoning trends and enduring classics that define the ever-evolving world of fashion. Explore the geographical distribution of these trends, shedding light on regional nuances and preferences that shape the global fashion ecosystem.
With 'Wardrobe Economics,' you'll be empowered to conduct in-depth analyses that illuminate the path to success in the realm of clothing and accessories. Whether you're a trend-savvy retailer, a market researcher, or a curious enthusiast, this dataset beckons you to unravel its insights and unlock a deeper appreciation for the intricate interplay of sales, categories, and geography within the realm of fashion.
!pip install fasteda
!pip install datacleaner
#for eda
%matplotlib inline
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings("ignore")
import numpy as np
from fasteda import fast_eda
from datacleaner import autoclean
import scipy
import scipy.stats as stats
df = pd.read_csv(r'C:\Users\manht\Documents\formation-python\Clothes and Accessories Sales and Product Details\Sales_Product_Details.csv')
df.head()
| Date | Customer_ID | Product_ID | Quantity | Unit_Price | Sales_Revenue | Product_Description | Product_Category | Product_Line | Raw_Material | Region | Latitude | Longitude | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 20210601 | 98 | 321 | 1 | 117.306016 | 117.306016 | Cycling Jerseys | Sports | Tops | Fabrics | York | 53.958332 | -1.080278 |
| 1 | 20210602 | 92 | 261 | 4 | 32.272403 | 129.089613 | Casual Shirts | Menswear | Tops | Cotton | Worcester | 52.192001 | -2.220000 |
| 2 | 20210603 | 92 | 264 | 1 | 36.193364 | 36.193364 | Casual Shirts | Menswear | Tops | Cotton | Worcester | 52.192001 | -2.220000 |
| 3 | 20210604 | 99 | 251 | 3 | 29.913403 | 89.740210 | Jeans | Menswear | Trousers | Cotton | Winchester | 51.063202 | -1.308000 |
| 4 | 20210605 | 66 | 251 | 1 | 41.843430 | 41.843430 | Shorts | Womenswear | Trousers | Cotton | Winchester | 51.063202 | -1.308000 |
df.shape
(30, 13)
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 30 entries, 0 to 29 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Date 30 non-null int64 1 Customer_ID 30 non-null int64 2 Product_ID 30 non-null int64 3 Quantity 30 non-null int64 4 Unit_Price 30 non-null float64 5 Sales_Revenue 30 non-null float64 6 Product_Description 30 non-null object 7 Product_Category 30 non-null object 8 Product_Line 30 non-null object 9 Raw_Material 30 non-null object 10 Region 30 non-null object 11 Latitude 30 non-null float64 12 Longitude 30 non-null float64 dtypes: float64(4), int64(4), object(5) memory usage: 3.2+ KB
df.dtypes
Date int64 Customer_ID int64 Product_ID int64 Quantity int64 Unit_Price float64 Sales_Revenue float64 Product_Description object Product_Category object Product_Line object Raw_Material object Region object Latitude float64 Longitude float64 dtype: object
df.describe()
| Date | Customer_ID | Product_ID | Quantity | Unit_Price | Sales_Revenue | Latitude | Longitude | |
|---|---|---|---|---|---|---|---|---|
| count | 3.000000e+01 | 30.000000 | 30.000000 | 30.000000 | 30.000000 | 30.000000 | 30.000000 | 30.000000 |
| mean | 2.021062e+07 | 57.400000 | 279.733333 | 2.066667 | 40.498759 | 79.687953 | 52.237571 | -2.270437 |
| std | 8.803408e+00 | 31.457251 | 24.259990 | 1.142693 | 16.742578 | 43.008559 | 1.449567 | 1.367443 |
| min | 2.021060e+07 | 9.000000 | 251.000000 | 1.000000 | 21.965812 | 21.965812 | 50.259998 | -5.051000 |
| 25% | 2.021061e+07 | 32.500000 | 262.250000 | 1.000000 | 32.390679 | 36.774078 | 51.063202 | -2.647000 |
| 50% | 2.021062e+07 | 51.000000 | 276.000000 | 2.000000 | 36.191911 | 79.261696 | 52.192001 | -1.490000 |
| 75% | 2.021062e+07 | 90.750000 | 287.500000 | 3.000000 | 44.341442 | 113.761987 | 53.680000 | -1.353500 |
| max | 2.021063e+07 | 99.000000 | 357.000000 | 4.000000 | 117.306016 | 175.486148 | 53.958332 | -1.080278 |
df.isnull().sum()
Date 0 Customer_ID 0 Product_ID 0 Quantity 0 Unit_Price 0 Sales_Revenue 0 Product_Description 0 Product_Category 0 Product_Line 0 Raw_Material 0 Region 0 Latitude 0 Longitude 0 dtype: int64
df.duplicated().sum()
0
df = autoclean(df)
df.head()
| Date | Customer_ID | Product_ID | Quantity | Unit_Price | Sales_Revenue | Product_Description | Product_Category | Product_Line | Raw_Material | Region | Latitude | Longitude | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 20210601 | 98 | 321 | 1 | 117.306016 | 117.306016 | 3 | 2 | 2 | 2 | 5 | 53.958332 | -1.080278 |
| 1 | 20210602 | 92 | 261 | 4 | 32.272403 | 129.089613 | 1 | 1 | 2 | 1 | 4 | 52.192001 | -2.220000 |
| 2 | 20210603 | 92 | 264 | 1 | 36.193364 | 36.193364 | 1 | 1 | 2 | 1 | 4 | 52.192001 | -2.220000 |
| 3 | 20210604 | 99 | 251 | 3 | 29.913403 | 89.740210 | 7 | 1 | 3 | 1 | 3 | 51.063202 | -1.308000 |
| 4 | 20210605 | 66 | 251 | 1 | 41.843430 | 41.843430 | 12 | 3 | 3 | 1 | 3 | 51.063202 | -1.308000 |
fast_eda(df)
DataFrame Head:
| Date | Customer_ID | Product_ID | Quantity | Unit_Price | Sales_Revenue | Product_Description | Product_Category | Product_Line | Raw_Material | Region | Latitude | Longitude | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 20210601 | 98 | 321 | 1 | 117.306016 | 117.306016 | 3 | 2 | 2 | 2 | 5 | 53.958332 | -1.080278 |
| 1 | 20210602 | 92 | 261 | 4 | 32.272403 | 129.089613 | 1 | 1 | 2 | 1 | 4 | 52.192001 | -2.220000 |
| 2 | 20210603 | 92 | 264 | 1 | 36.193364 | 36.193364 | 1 | 1 | 2 | 1 | 4 | 52.192001 | -2.220000 |
DataFrame Tail:
| Date | Customer_ID | Product_ID | Quantity | Unit_Price | Sales_Revenue | Product_Description | Product_Category | Product_Line | Raw_Material | Region | Latitude | Longitude | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 27 | 20210628 | 56 | 262 | 1 | 33.470936 | 33.470936 | 5 | 1 | 2 | 5 | 5 | 53.958332 | -1.080278 |
| 28 | 20210629 | 13 | 286 | 1 | 32.745507 | 32.745507 | 8 | 3 | 2 | 0 | 2 | 51.209000 | -2.647000 |
| 29 | 20210630 | 91 | 291 | 1 | 31.879107 | 31.879107 | 8 | 1 | 2 | 0 | 2 | 51.209000 | -2.647000 |
----------------------------------------------------------------------------------------------------
Missing values:
| 0 |
|---|
---------------------------------------------------------------------------------------------------- Shape of DataFrame: (30, 13) ---------------------------------------------------------------------------------------------------- DataFrame Info: <class 'pandas.core.frame.DataFrame'> RangeIndex: 30 entries, 0 to 29 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Date 30 non-null int64 1 Customer_ID 30 non-null int64 2 Product_ID 30 non-null int64 3 Quantity 30 non-null int64 4 Unit_Price 30 non-null float64 5 Sales_Revenue 30 non-null float64 6 Product_Description 30 non-null int32 7 Product_Category 30 non-null int32 8 Product_Line 30 non-null int32 9 Raw_Material 30 non-null int32 10 Region 30 non-null int32 11 Latitude 30 non-null float64 12 Longitude 30 non-null float64 dtypes: float64(4), int32(5), int64(4) memory usage: 2.6 KB ---------------------------------------------------------------------------------------------------- Describe DataFrame:
| count | mean | median | std | min | 25% | 50% | 75% | max | skewness | kurtosis | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Date | 30 | 20210615.5 | 20210615.5 | 8.803 | 20210601 | 20210608.25 | 20210615.5 | 20210622.75 | 20210630 | 0.000000 | -1.202670 |
| Customer_ID | 30 | 57.4 | 51 | 31.457 | 9 | 32.5 | 51 | 90.75 | 99 | -0.045345 | -1.382609 |
| Product_ID | 30 | 279.733 | 276 | 24.26 | 251 | 262.25 | 276 | 287.5 | 357 | 1.527557 | 2.302047 |
| Quantity | 30 | 2.067 | 2 | 1.143 | 1 | 1 | 2 | 3 | 4 | 0.432925 | -1.324588 |
| Unit_Price | 30 | 40.499 | 36.192 | 16.743 | 21.966 | 32.391 | 36.192 | 44.341 | 117.306 | 3.305144 | 12.956859 |
| Sales_Revenue | 30 | 79.688 | 79.262 | 43.009 | 21.966 | 36.774 | 79.262 | 113.762 | 175.486 | 0.370341 | -0.994856 |
| Product_Description | 30 | 8.133 | 8 | 4.531 | 0 | 5 | 8 | 11 | 17 | 0.063229 | -0.719756 |
| Product_Category | 30 | 1.867 | 1.5 | 1.074 | 0 | 1 | 1.5 | 3 | 3 | -0.071921 | -1.578642 |
| Product_Line | 30 | 2.067 | 2 | 0.583 | 0 | 2 | 2 | 2 | 3 | -1.057225 | 4.189920 |
| Raw_Material | 30 | 1.9 | 1 | 1.647 | 0 | 1 | 1 | 3 | 5 | 0.820911 | -0.693793 |
| Region | 30 | 2.1 | 2 | 1.668 | 0 | 1 | 2 | 3 | 5 | 0.474169 | -0.995923 |
| Latitude | 30 | 52.238 | 52.192 | 1.45 | 50.26 | 51.063 | 52.192 | 53.68 | 53.958 | -0.036826 | -1.677367 |
| Longitude | 30 | -2.27 | -1.49 | 1.367 | -5.051 | -2.647 | -1.49 | -1.353 | -1.08 | -1.277935 | 0.201208 |
----------------------------------------------------------------------------------------------------
DataFrame Correlation:
----------------------------------------------------------------------------------------------------
DataFrame Pairplot:
----------------------------------------------------------------------------------------------------
Histogram(s) & Boxplot(s):
----------------------------------------------------------------------------------------------------
Countplot(s):
for column in df.columns :
plt.figure(figsize = (10,4))
plt.subplot(121)
sns.histplot(df[column])
plt.title(column)
plt.subplot(122)
stats.probplot(df[column],dist = 'norm', plot = plt)
plt.title(column)
plt.show()
from sklearn.preprocessing import QuantileTransformer
qt = QuantileTransformer(output_distribution='normal')
for col in df.columns:
df[col] = df[col] = qt.fit_transform(pd.DataFrame(df[col]))
for column in df.columns :
plt.figure(figsize = (10,4))
plt.subplot(121)
sns.histplot(df[column])
plt.title(column)
plt.subplot(122)
stats.probplot(df[column],dist = 'norm', plot = plt)
plt.title(column)
plt.show()
for column in df.columns :
plt.figure(figsize = (10,4))
plt.subplot(121)
sns.histplot(df[column])
plt.title(column)
plt.subplot(122)
stats.probplot(df[column],dist = 'norm', plot = plt)
plt.title(column)
plt.show()
for column in df:
plt.figure(figsize=(17,1))
sns.boxplot(data = df, x = column)
for col in df:
q1 = df[col].quantile(0.25)
q3 = df[col].quantile(0.75)
iqr = q3 - q1
whisker_width = 1.5
lower_whisker = q1 - (whisker_width * iqr)
upper_whisker = q3 + whisker_width * iqr
df[col] = np.where(df[col] > upper_whisker, upper_whisker, np.where(df[col] < lower_whisker, lower_whisker, df[col]))
for column in df:
plt.figure(figsize=(17,1))
sns.boxplot(data = df, x = column)
df.drop("Product_Line", axis = 1, inplace = True)
df.head()
| Date | Customer_ID | Product_ID | Quantity | Unit_Price | Sales_Revenue | Product_Description | Product_Category | Raw_Material | Region | Latitude | Longitude | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | -2.700796 | 1.628361 | 1.483540 | -5.199338 | 2.700796 | 0.817237 | -1.089662 | 0.086543 | 0.399323 | 2.103162 | 2.625385 | 2.837372 |
| 1 | -1.818646 | 0.879168 | -0.817237 | 5.199338 | -0.701873 | 1.089662 | -1.628361 | -0.595179 | -0.307293 | 0.944670 | 0.043231 | -0.307293 |
| 2 | -1.483540 | 0.879168 | -0.494873 | -5.199338 | 0.043231 | -0.701873 | -1.628361 | -0.595179 | -0.307293 | 0.944670 | 0.043231 | -0.307293 |
| 3 | -1.262145 | 2.700796 | -2.681732 | 0.647604 | -0.944670 | 0.217798 | -0.307293 | -0.595179 | -0.307293 | 0.544341 | -0.758293 | 0.879168 |
| 4 | -1.089662 | 0.262283 | -2.681732 | -5.199338 | 0.494873 | -0.494873 | 0.879168 | 5.199338 | -0.307293 | 0.544341 | -0.758293 | 0.879168 |
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.svm import SVR
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
from tabulate import tabulate
# Extract features and target
X = df.drop("Sales_Revenue", axis = 1)
Y = df["Sales_Revenue"]
# Split the data into training and testing sets
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2)
regLR = LinearRegression()
regLR.fit(X_train,Y_train)
yPredRegLR=regLR.predict(X_test)
mean_squared_error(yPredRegLR, Y_test)
0.4044664482334006
regSVR=SVR(kernel='linear')
regSVR.fit(X_train,Y_train)
yPredRegSVR=regSVR.predict(X_test)
mean_squared_error(yPredRegSVR, Y_test)
0.3142226059467846
regDTR=DecisionTreeRegressor()
regDTR.fit(X_train,Y_train)
yPredRegDTR=regDTR.predict(X_test)
mean_squared_error(yPredRegDTR, Y_test)
0.8018579611142632
regRFR=RandomForestRegressor()
regRFR.fit(X_train,Y_train)
yPredRegRFR=regRFR.predict(X_test)
mean_squared_error(yPredRegRFR, Y_test)
1.0856999572521733
data=[
["Linear Regression",round(mean_absolute_error(yPredRegLR, Y_test),2),round(np.sqrt(mean_squared_error(yPredRegLR, Y_test)),2),round(r2_score(yPredRegLR, Y_test),2)],
["Support Vector Regression",round(mean_absolute_error(yPredRegSVR, Y_test),2),round(np.sqrt(mean_squared_error(yPredRegSVR, Y_test)),2),round(r2_score(yPredRegSVR, Y_test),2)],
["Decision Tree Regression",round(mean_absolute_error(yPredRegDTR, Y_test),2),round(np.sqrt(mean_squared_error(yPredRegDTR, Y_test)),2),round(r2_score(yPredRegDTR, Y_test),2)],
["Random Forest Regression",round(mean_absolute_error(yPredRegRFR, Y_test),2),round(np.sqrt(mean_squared_error(yPredRegRFR, Y_test)),2),round(r2_score(yPredRegRFR, Y_test),2)]
]
columns=["Model Name","Mean Absolute Error","Root Mean Squared Error","R Squared Error"]
print(tabulate(data, headers=columns, tablefmt="fancy_grid"))
╒═══════════════════════════╤═══════════════════════╤═══════════════════════════╤═══════════════════╕ │ Model Name │ Mean Absolute Error │ Root Mean Squared Error │ R Squared Error │ ╞═══════════════════════════╪═══════════════════════╪═══════════════════════════╪═══════════════════╡ │ Linear Regression │ 0.52 │ 0.64 │ 0.49 │ ├───────────────────────────┼───────────────────────┼───────────────────────────┼───────────────────┤ │ Support Vector Regression │ 0.47 │ 0.56 │ 0.59 │ ├───────────────────────────┼───────────────────────┼───────────────────────────┼───────────────────┤ │ Decision Tree Regression │ 0.8 │ 0.9 │ -1.41 │ ├───────────────────────────┼───────────────────────┼───────────────────────────┼───────────────────┤ │ Random Forest Regression │ 0.91 │ 1.04 │ -0.71 │ ╘═══════════════════════════╧═══════════════════════╧═══════════════════════════╧═══════════════════╛
# Create a DataFrame with actual and predicted values
results_df = pd.DataFrame({
'Actual Values': Y_test.values,
'Predicted Values': yPredRegRFR})
results_df.head(10)
| Actual Values | Predicted Values | |
|---|---|---|
| 0 | -0.043231 | 0.801596 |
| 1 | 0.043231 | 0.593894 |
| 2 | 0.817237 | -0.754413 |
| 3 | 1.483540 | 0.781099 |
| 4 | -0.399323 | -0.614952 |
| 5 | -0.130019 | 1.447078 |
# Plotting the linear regression line
plt.scatter(Y_test, yPredRegRFR, alpha=0.7, label='Predicted')
plt.plot([min(Y_test), max(Y_test)], [min(Y_test), max(Y_test)], linestyle='--', color='red', label='Perfect Prediction')
plt.xlabel('Actual Values')
plt.ylabel('Predicted Values')
plt.title('Predicted vs. Actual Values')
plt.legend()
plt.show()